Welcome to the project on PCA and t-SNE. In this project, we will be using the auto-mpg dataset.
The shifting market conditions, globalization, cost pressure, and volatility are leading to a change in the automobile market landscape. The emergence of data, in conjunction with machine learning in automobile companies, has paved a way that is helping bring operational and business transformations.
The automobile market is vast and diverse, with numerous vehicle categories being manufactured and sold with varying configurations of attributes such as displacement, horsepower, and acceleration. We aim to find combinations of these features that can clearly distinguish certain groups of automobiles from others through this analysis, as this will inform other downstream processes for any organization aiming to sell each group of vehicles to a slightly different target audience.
You are a Data Scientist at SecondLife which is a leading used car dealership with numerous outlets across the US. Recently, they have started shifting their focus to vintage cars and have been diligently collecting data about all the vintage cars they have sold over the years. The Director of Operations at SecondLife wants to leverage the data to extract insights about the cars and find different groups of vintage cars to target the audience more efficiently.
The objective of this problem is to explore the data, reduce the number of features by using dimensionality reduction techniques like PCA and t-SNE, and extract meaningful insights.
There are 8 variables in the data:
Brief Insight about the Dataset:
The dataset contains information about various vintage cars, with each row representing a car and each column capturing specific characteristics of the cars. These features include factors like miles per gallon (mpg), number of cylinders, engine displacement, horsepower, weight, acceleration, model year, and car name. Since this is an unsupervised learning task, there is no target variable, and all the features are considered independent. The goal is to explore and group the cars based on these characteristics to uncover meaningful patterns and insights.
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd
# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme(style='darkgrid')
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# To scale the data using z-score
from sklearn.preprocessing import StandardScaler
# To compute distances
from scipy.spatial.distance import cdist, pdist
# To perform K-Means clustering and compute silhouette scores
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# To import DBSCAN and Gaussian Mixture
from sklearn.cluster import DBSCAN
from sklearn.mixture import GaussianMixture
# To perform hierarchical clustering, compute cophenetic correlation, and create dendrograms
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage, cophenet
import warnings
warnings.filterwarnings("ignore")
# Specify the path to your CSV file
file_path = '/Users/naazafreen/Desktop/Data Sciencelearning MIT Program/auto-mpg.csv'
# Load the CSV file into a DataFrame
df = pd.read_csv(file_path)
Purpose: This is the process of exploring and getting familiar with the dataset.
# Basic overview
print("First few rows:")
print(df.head())
# shape function returns the number of rows and columns in the dataset.
print(" Total number of rows and columns:")
df.shape
First few rows:
mpg cylinders displacement horsepower weight acceleration model year \
0 18.0 8 307.0 130 3504 12.0 70
1 15.0 8 350.0 165 3693 11.5 70
2 18.0 8 318.0 150 3436 11.0 70
3 16.0 8 304.0 150 3433 12.0 70
4 17.0 8 302.0 140 3449 10.5 70
car name
0 chevrolet chevelle malibu
1 buick skylark 320
2 plymouth satellite
3 amc rebel sst
4 ford torino
Total number of rows and columns:
(398, 8)
print("\n Unique Vintage Car:")
df['car name'].nunique()
Unique Vintage Car:
305
df['car name'].nunique()
df['car name'].value_counts()
car name
ford pinto 6
toyota corolla 5
amc matador 5
ford maverick 5
chevrolet chevette 4
..
chevrolet monza 2+2 1
ford mustang ii 1
pontiac astro 1
amc pacer 1
chevy s-10 1
Name: count, Length: 305, dtype: int64
print("\nDescriptive Statistics:")
print(df.describe(include='all').T)
Descriptive Statistics:
count unique top freq mean std min \
mpg 398.0 NaN NaN NaN 23.514573 7.815984 9.0
cylinders 398.0 NaN NaN NaN 5.454774 1.701004 3.0
displacement 398.0 NaN NaN NaN 193.425879 104.269838 68.0
horsepower 398 94 150 22 NaN NaN NaN
weight 398.0 NaN NaN NaN 2970.424623 846.841774 1613.0
acceleration 398.0 NaN NaN NaN 15.56809 2.757689 8.0
model year 398.0 NaN NaN NaN 76.01005 3.697627 70.0
car name 398 305 ford pinto 6 NaN NaN NaN
25% 50% 75% max
mpg 17.5 23.0 29.0 46.6
cylinders 4.0 4.0 8.0 8.0
displacement 104.25 148.5 262.0 455.0
horsepower NaN NaN NaN NaN
weight 2223.75 2803.5 3608.0 5140.0
acceleration 13.825 15.5 17.175 24.8
model year 73.0 76.0 79.0 82.0
car name NaN NaN NaN NaN
# Year of manufacture (to define vintage ranges)
# Get the unique model years
distinct_model_years = df['model year'].unique()
# Print the distinct model years
print("Distinct model years:", distinct_model_years)
Distinct model years: [70 71 72 73 74 75 76 77 78 79 80 81 82]
Observation :
Year data is displayed as a two-digit number (e.g., 70-82), it's most likely that it represents years in the 20th century, such as 1970-1982.
For better readibility, I'll convert the Two-Digit Years into Four-Digit Years.
#create a rule that adds 1900 to the two-digit years to infer the full year.
# Convert two-digit years into full four-digit years by assuming 1900s
df['model year'] = df['model year'].apply(lambda x: 1900 + x) # This applies a function to each value in the year column, adding 1900 to the two-digit year to convert it into the full year.
# Display the updated DataFrame
print(df)
mpg cylinders displacement horsepower weight acceleration \
0 18.0 8 307.0 130 3504 12.0
1 15.0 8 350.0 165 3693 11.5
2 18.0 8 318.0 150 3436 11.0
3 16.0 8 304.0 150 3433 12.0
4 17.0 8 302.0 140 3449 10.5
.. ... ... ... ... ... ...
393 27.0 4 140.0 86 2790 15.6
394 44.0 4 97.0 52 2130 24.6
395 32.0 4 135.0 84 2295 11.6
396 28.0 4 120.0 79 2625 18.6
397 31.0 4 119.0 82 2720 19.4
model year car name
0 1970 chevrolet chevelle malibu
1 1970 buick skylark 320
2 1970 plymouth satellite
3 1970 amc rebel sst
4 1970 ford torino
.. ... ...
393 1982 ford mustang gl
394 1982 vw pickup
395 1982 dodge rampage
396 1982 ford ranger
397 1982 chevy s-10
[398 rows x 8 columns]
# We utilize box plots and histograms to visualize key numerical data, which can be essential for identifying the best groups of vintage cars.
# Function to plot a boxplot and a histogram along the same scale
def histogram_boxplot(data, feature, figsize = (12, 7), kde = False, bins = None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12, 7))
kde: whether to the show density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows = 2, # Number of rows of the subplot grid = 2
sharex = True, # X-axis will be shared among all subplots
gridspec_kw = {"height_ratios": (0.25, 0.75)},
figsize = figsize,
) # Creating the 2 subplots
sns.boxplot(
data = data, x = feature, ax = ax_box2, showmeans = True, color = "violet"
) # Boxplot will be created and a star will indicate the mean value of the column
sns.histplot(
data = data, x = feature, kde = kde, ax = ax_hist2, bins = bins
) if bins else sns.histplot(
data = data, x = feature, kde = kde, ax = ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color = "green", linestyle = "--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color = "black", linestyle = "-"
) # Add median to the histogram
histogram_boxplot(df,'model year',kde=True,bins=13)
#acceleration column represents the time it takes for a car to go from 0 to 60 miles per hour in seconds. This is a common performance measure for vehicles, especially for evaluating speed and engine performance.
#histogram for the acceleration column, it will show how the cars in Automobile dataset are distributed based on their acceleration times.
#Peaks:If the histogram has a peak around a certain value (for example, 15 seconds), that means a lot of cars in your dataset have an acceleration time around that value.
#Skewness:If the histogram is skewed to the right (has a longer tail on the right), it means that there are more slower cars in your dataset (with higher acceleration times).If it’s skewed to the left, it means that more cars in your dataset have faster acceleration.
# Plot the histogram for 'acceleration'
plt.figure(figsize=(12,4))
sns.histplot(df['acceleration'],bins=24,edgecolor='black',kde=True)
# Calculate mean and add a vertical line for mean
plt.axvline(df['acceleration'].mean(), color='red', linestyle='dashed', linewidth=2, label='Mean')
# Add titles and labels
plt.title('Histogram data displays the Vintage cars distributed based on their acceleration times')
plt.xlabel('acceleration ( sec)')
# Show the plot
plt.show()
# We utilize barplots to visualize categorial data, which can be essential for identifying the best groups of vintage cars.
#The car name column in the dataset identifies the specific model of each car. This feature typically includes details such as the brand and model name, which are crucial for distinguishing between different vehicles.
# First, filter for car names that occur more than once
car_counts = df['car name'].value_counts()
top_cars = car_counts[car_counts > 1].nlargest(50).index
# Now, plot the count of these top car names
plt.figure(figsize=(20,5))
sns.countplot(data=df[df['car name'].isin(top_cars)], x='car name', order=top_cars)
plt.title('Top 50 Car Names with More Than One Occurrence')
plt.xticks(rotation=90)
plt.xlabel('Car Name')
plt.ylabel('Count')
plt.show()
The DataFrame has 398 rows/data points and 8 columns/features. Data in each row corresponds to the Vintage Car with its charecteristics.
Total number of unique Vintage cars are 305.Which suggest that we could have repeatation of vintage cars with different charecteristics.
The most frequent car model in the dataset is Ford Pinto, which appears 6 times. This might indicate that the Ford Pinto was a popular car during this period, but given the dataset size, it may also reflect the data source’s preferences or availability.
Purpose: This is a process of validating the dataset to ensure it is logical,consistent, and usable.
Actions:
# Sanity checks
print("\nMissing Values:")
print(df.isnull().sum())
Missing Values: mpg 0 cylinders 0 displacement 0 horsepower 0 weight 0 acceleration 0 model year 0 car name 0 dtype: int64
print("The datatypes of the different columns in the dataset are as follows:")
df.info()
The datatypes of the different columns in the dataset are as follows: <class 'pandas.core.frame.DataFrame'> RangeIndex: 398 entries, 0 to 397 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 mpg 398 non-null float64 1 cylinders 398 non-null int64 2 displacement 398 non-null float64 3 horsepower 398 non-null object 4 weight 398 non-null int64 5 acceleration 398 non-null float64 6 model year 398 non-null int64 7 car name 398 non-null object dtypes: float64(3), int64(3), object(2) memory usage: 25.0+ KB
1) There are no missing values in any of the columns in the dataset.
2) According to the info() function, we notice float64(3), int64(3), object(2) columns.
Note: Although the horsepower column contains integer values, the info() function indicates it is stored as an object data type, which needs to be corrected for proper numerical analysis.
# Check the current datatype of the horsepower column
print(df['horsepower'].dtype)
object
df.head()
| mpg | cylinders | displacement | horsepower | weight | acceleration | model year | car name | |
|---|---|---|---|---|---|---|---|---|
| 0 | 18.0 | 8 | 307.0 | 130 | 3504 | 12.0 | 1970 | chevrolet chevelle malibu |
| 1 | 15.0 | 8 | 350.0 | 165 | 3693 | 11.5 | 1970 | buick skylark 320 |
| 2 | 18.0 | 8 | 318.0 | 150 | 3436 | 11.0 | 1970 | plymouth satellite |
| 3 | 16.0 | 8 | 304.0 | 150 | 3433 | 12.0 | 1970 | amc rebel sst |
| 4 | 17.0 | 8 | 302.0 | 140 | 3449 | 10.5 | 1970 | ford torino |
print(df['horsepower'].unique()) # This will show any non-numeric values
['130' '165' '150' '140' '198' '220' '215' '225' '190' '170' '160' '95' '97' '85' '88' '46' '87' '90' '113' '200' '210' '193' '?' '100' '105' '175' '153' '180' '110' '72' '86' '70' '76' '65' '69' '60' '80' '54' '208' '155' '112' '92' '145' '137' '158' '167' '94' '107' '230' '49' '75' '91' '122' '67' '83' '78' '52' '61' '93' '148' '129' '96' '71' '98' '115' '53' '81' '79' '120' '152' '102' '108' '68' '58' '149' '89' '63' '48' '66' '139' '103' '125' '133' '138' '135' '142' '77' '62' '132' '84' '64' '74' '116' '82']
Observation:
The output shown doesn't have any obvious non-numeric values. All the entries are numbers represented as strings. This can happen if the column has been incorrectly read as a string (object) type rather than as numbers.
# Replace '?' with NaN in the entire DataFrame
df.replace('?', np.nan, inplace=True)
# Replace '?' with NaN in the entire DataFrame
df.replace('?', np.nan, inplace=True)
# Convert the problematic column to numeric, forcing any non-numeric values to NaN
df['horsepower'] = pd.to_numeric(df['horsepower'], errors='coerce')
print(df['horsepower'].dtype)
float64
print(df.info()) # Should now show 'float64' or 'int64'
<class 'pandas.core.frame.DataFrame'> RangeIndex: 398 entries, 0 to 397 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 mpg 398 non-null float64 1 cylinders 398 non-null int64 2 displacement 398 non-null float64 3 horsepower 392 non-null float64 4 weight 398 non-null int64 5 acceleration 398 non-null float64 6 model year 398 non-null int64 7 car name 398 non-null object dtypes: float64(4), int64(3), object(1) memory usage: 25.0+ KB None
Observation:
The output shows 'float64' as a datatype for column horsepower.
# Check for duplicated rows
duplicates = df.duplicated().sum()
# Show the duplicated rows
print(f"Number of duplicated rows: {duplicates}")
Number of duplicated rows: 0
Observation
No Duplicates = Higher Data Integrity: The absence of duplicates indicates good data quality, leading to more accurate and reliable insights from our analysis.
Better Machine Learning Performance: Not having duplicates ensures that the models won’t be biased toward over-represented car types.
# Find rows where 'car name' and 'yr' have duplicates
duplicates = df[df.duplicated(subset=['car name', 'model year'],keep=False)]
print(duplicates)
mpg cylinders displacement horsepower weight acceleration \
168 23.0 4 140.0 83.0 2639 17.0
174 18.0 6 171.0 97.0 2984 14.5
338 27.2 4 135.0 84.0 2490 15.7
342 30.0 4 135.0 84.0 2385 12.9
model year car name
168 1975 ford pinto
174 1975 ford pinto
338 1981 plymouth reliant
342 1981 plymouth reliant
Variations in the Same Model: Even though the car name and model year are the same, manufacturers often offer different versions of the same model. This could be to cater to varying customer needs, such as those looking for fuel efficiency vs. power or performance.
Impact of Engine and Weight on Fuel Efficiency: The cars with higher displacement, more cylinders, and more horsepower tend to have lower mpg (fuel efficiency), while the lighter and less powerful versions have higher mpg.
Customization Options: These differences suggest that buyers may have had options to choose different configurations (engine size, weight, etc.) based on their preferences or needs.
This information can help in segmenting these cars into different audience groups based on what potential buyers value (e.g., fuel efficiency vs. performance).
# Select only numeric columns
numeric_df = df.select_dtypes(include=[np.number])
# Calculate correlation
correlation_matrix = numeric_df.corr()
print(correlation_matrix)
plt.figure(figsize=(8,6))
sns.heatmap(correlation_matrix, annot=True,cmap='coolwarm')
mpg cylinders displacement horsepower weight \
mpg 1.000000 -0.775396 -0.804203 -0.778427 -0.831741
cylinders -0.775396 1.000000 0.950721 0.842983 0.896017
displacement -0.804203 0.950721 1.000000 0.897257 0.932824
horsepower -0.778427 0.842983 0.897257 1.000000 0.864538
weight -0.831741 0.896017 0.932824 0.864538 1.000000
acceleration 0.420289 -0.505419 -0.543684 -0.689196 -0.417457
model year 0.579267 -0.348746 -0.370164 -0.416361 -0.306564
acceleration model year
mpg 0.420289 0.579267
cylinders -0.505419 -0.348746
displacement -0.543684 -0.370164
horsepower -0.689196 -0.416361
weight -0.417457 -0.306564
acceleration 1.000000 0.288137
model year 0.288137 1.000000
<Axes: >
Observations:
plt.figure(figsize = (20, 10))
numeric_columns = df.select_dtypes(include = np.number).columns.tolist()
for i, variable in enumerate(numeric_columns):
plt.subplot(2, 5, i + 1)
plt.boxplot(df[variable], whis = 1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
Observations on Outliers in the Numerical Features:
# The plt.boxplot() function cannot handle NaN values directly, so if there are any NaN values in the column, the plot might not display correctly, or it could be blank.
# Solution: Remove or Handle NaN Values Before Plotting
# We will handle the NaN values with a reasonable statistic (like the mean or median) before plotting. Since horsepower is a critical feature for analyzing and grouping vintage cars, we will carefully handle these missing values to ensure the data is accurate and complete.
# Fill missing values with the median
median_value = df['horsepower'].median()
df['horsepower'].fillna(median_value, inplace=True)
# Plot boxplot after filling missing values
plt.figure(figsize=(20, 10))
plt.boxplot(df['horsepower'], whis=1.5)
plt.title('Boxplot of Horsepower')
plt.show();
Observations on boxplot diaplay of the horsepower feature :
# 1. Descriptive Statistics
df.describe()
| mpg | cylinders | displacement | horsepower | weight | acceleration | model year | |
|---|---|---|---|---|---|---|---|
| count | 398.000000 | 398.000000 | 398.000000 | 398.000000 | 398.000000 | 398.000000 | 398.000000 |
| mean | 23.514573 | 5.454774 | 193.425879 | 104.304020 | 2970.424623 | 15.568090 | 1976.010050 |
| std | 7.815984 | 1.701004 | 104.269838 | 38.222625 | 846.841774 | 2.757689 | 3.697627 |
| min | 9.000000 | 3.000000 | 68.000000 | 46.000000 | 1613.000000 | 8.000000 | 1970.000000 |
| 25% | 17.500000 | 4.000000 | 104.250000 | 76.000000 | 2223.750000 | 13.825000 | 1973.000000 |
| 50% | 23.000000 | 4.000000 | 148.500000 | 93.500000 | 2803.500000 | 15.500000 | 1976.000000 |
| 75% | 29.000000 | 8.000000 | 262.000000 | 125.000000 | 3608.000000 | 17.175000 | 1979.000000 |
| max | 46.600000 | 8.000000 | 455.000000 | 230.000000 | 5140.000000 | 24.800000 | 1982.000000 |
# 2. Distribution Analysis
# Histograms: To visualize the distribution of numeric features (e.g., normal distribution, skewness).
# Density Plots: To visualize the probability distribution of continuous data.
plt.figure(figsize = (20, 10))
numeric_columns = df.select_dtypes(include = np.number).columns.tolist()
for i, variable in enumerate(numeric_columns):
plt.subplot(2, 5, i + 1)
df[variable].hist()
plt.tight_layout()
plt.title(variable)
plt.show()
Observations:
The distribution plots indicate that the variables mpg, displacement, horsepower, and weight exhibit a significant right skew.
The acceleration variable shows a moderate right skew.
The model year variable appears to have a relatively uniform distribution, with several peaks corresponding to specific years.
The cylinder variable presents a non-continuous distribution, displaying peaks at certain cylinder counts while showing lower frequencies at others.
# 3. Missing Values Analysis
df.isnull().sum()
mpg 0 cylinders 0 displacement 0 horsepower 0 weight 0 acceleration 0 model year 0 car name 0 dtype: int64
observation: There is no missing data.
# 5. Categorical Feature Analysis
plt.figure(figsize = (40, 5))
df['car name'].value_counts()
sns.countplot(x='car name', data=df)
plt.xticks(rotation = 90)
plt.show()
Observation:
Frequency Distribution: The count plot reveals the distribution of different car names in the dataset, allowing us to identify which models are most and least prevalent.
Popular Car Models: Certain car names are significantly more common than others, indicating that these models may have higher sales or popularity during the time frame of the dataset. Ford Pinto has the maximum count.
Insights on Trends: This analysis could provide insights into market trends, brand popularity, and consumer preferences for specific car models within the dataset.
#8. Feature Relationships (Bivariate Analysis)
plt.figure(figsize = (40, 5))
sns.scatterplot(x='car name', y='model year', data=df)
plt.xticks(rotation = 90)
plt.show()
Insights:
Trends Over Time: The scatter plot visualizes how different car models relate to their manufacturing years, indicating trends in the automotive industry over time. This can help identify which models were popular in specific decades.
Concentration of Models: There may be clusters of car names corresponding to certain model years, suggesting that specific models were produced during particular time periods. This can indicate brand strategies or market demands.
Variation in Production Years: Some car names may show a broader range of production years, while others may have a more concentrated production span, hinting at model longevity or discontinuation.
Discontinuation of Models: A noticeable gap in the scatter plot for certain car names could indicate models that were discontinued after a certain year, pointing to changing consumer preferences or manufacturer strategies.
Market Evolution: The plot can help illustrate the evolution of car design and technology over the years, with certain car names reflecting historical or stylistic changes in the industry.
sns.pairplot(df)
<seaborn.axisgrid.PairGrid at 0x157ee07a0>
df.head(1)
| mpg | cylinders | displacement | horsepower | weight | acceleration | model year | car name | |
|---|---|---|---|---|---|---|---|---|
| 0 | 18.0 | 8 | 307.0 | 130.0 | 3504 | 12.0 | 1970 | chevrolet chevelle malibu |
Pairwise Relationships:
The pair plot provides a visual representation of the relationships between each pair of features in the dataset. By examining these scatter plots, we can identify potential correlations or patterns between variables.
Why we need to Scale the data ?
# Scaling the data before clustering. It is performed on all the numerical features in the dataset.
# df.iloc[:, :7] selects all rows and columns starting from the 0 to 6th column.
# copy()ensures that you're working with a separate copy of this subset of data, not just a reference to the original DataFrame.
scaler = StandardScaler()
subset = df.iloc[:,:7].copy()
subset_scaled = scaler.fit_transform(subset)
subset_scaled
array([[-0.7064387 , 1.49819126, 1.0906037 , ..., 0.63086987,
-1.29549834, -1.62742629],
[-1.09075062, 1.49819126, 1.5035143 , ..., 0.85433297,
-1.47703779, -1.62742629],
[-0.7064387 , 1.49819126, 1.19623199, ..., 0.55047045,
-1.65857724, -1.62742629],
...,
[ 1.08701694, -0.85632057, -0.56103873, ..., -0.79858454,
-1.4407299 , 1.62198339],
[ 0.57460104, -0.85632057, -0.70507731, ..., -0.40841088,
1.10082237, 1.62198339],
[ 0.95891297, -0.85632057, -0.71467988, ..., -0.29608816,
1.39128549, 1.62198339]])
# Creating a dataframe of the scaled data
subset_scaled_df = pd.DataFrame(subset_scaled, columns = subset.columns)
subset_scaled_df
| mpg | cylinders | displacement | horsepower | weight | acceleration | model year | |
|---|---|---|---|---|---|---|---|
| 0 | -0.706439 | 1.498191 | 1.090604 | 0.673118 | 0.630870 | -1.295498 | -1.627426 |
| 1 | -1.090751 | 1.498191 | 1.503514 | 1.589958 | 0.854333 | -1.477038 | -1.627426 |
| 2 | -0.706439 | 1.498191 | 1.196232 | 1.197027 | 0.550470 | -1.658577 | -1.627426 |
| 3 | -0.962647 | 1.498191 | 1.061796 | 1.197027 | 0.546923 | -1.295498 | -1.627426 |
| 4 | -0.834543 | 1.498191 | 1.042591 | 0.935072 | 0.565841 | -1.840117 | -1.627426 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 393 | 0.446497 | -0.856321 | -0.513026 | -0.479482 | -0.213324 | 0.011586 | 1.621983 |
| 394 | 2.624265 | -0.856321 | -0.925936 | -1.370127 | -0.993671 | 3.279296 | 1.621983 |
| 395 | 1.087017 | -0.856321 | -0.561039 | -0.531873 | -0.798585 | -1.440730 | 1.621983 |
| 396 | 0.574601 | -0.856321 | -0.705077 | -0.662850 | -0.408411 | 1.100822 | 1.621983 |
| 397 | 0.958913 | -0.856321 | -0.714680 | -0.584264 | -0.296088 | 1.391285 | 1.621983 |
398 rows × 7 columns
# "Similarly, we can use z-score normalization or standardization to standardize the `horsepower` column (or any column) in the DataFrame. This process scales the data in the same way as the `StandardScaler()` function does."
(df['horsepower']-df['horsepower'].mean()) /df['horsepower'].std()
0 0.672271
1 1.587959
2 1.195522
3 1.195522
4 0.933897
...
393 -0.478879
394 -1.368405
395 -0.531204
396 -0.662017
397 -0.583529
Name: horsepower, Length: 398, dtype: float64
PCA (Principal Component Analysis) helps when some of the features in the data are too similar, which can cause problems. These similar features can confuse algorithms, like clustering, because they provide duplicate information.
PCA solves this by finding new "important" features that represent the most variation (or differences) in the data.The new features created by PCA, called principal components, are not related to each other, which removes the overlap of information.
This makes the clustering more accurate and easier because it focuses on the main patterns in the data without any confusion from similar features.
# Importing PCA
from sklearn.decomposition import PCA
# Defining the number of principal components to generate
n = subset.shape[1] # Storing the number of variables in the data
pca = PCA(n_components = n, random_state = 1) # Storing PCA function with n components
data_pca = pd.DataFrame(pca.fit_transform(subset_scaled_df )) # Applying PCA on scaled data
# Renaming the columns of the PCA DataFrame
data_pca.columns = [f'PC{i+1}' for i in range(data_pca.shape[1])]
# The percentage of variance explained by each principal component is stored
exp_var = (pca.explained_variance_ratio_)
exp_var
array([0.71476787, 0.1236554 , 0.10414042, 0.02671968, 0.01778617,
0.00790889, 0.00502158])
# Plotting the explained cumulative variance by principal components
plt.figure(figsize = (10, 10))
plt.plot(range(1, len(exp_var) + 1), pca.explained_variance_ratio_.cumsum(), marker = 'o', linestyle = '--')
plt.title("Explained Variances by Components")
plt.xlabel("Number of Components")
plt.ylabel("Cumulative Explained Variance")
plt.show()
# On suming up the explained variance ratios of the first 5 components, it gives us the total variance explained by these components.
sum(pca.explained_variance_ratio_[:3])
0.9425636832481039
Observations: We can see that out of the original 8 features, we have reduced the number of features through PCA to 4 principal components. The first four principal components explain approximately 94% of the original variance. So that is about a 50% reduction in the dimensionality of the dataset with only a loss of less than 10% in variance.
#New Feature Space
data_pca
| PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | |
|---|---|---|---|---|---|---|---|
| 0 | 2.661556 | -0.918577 | -0.558420 | 0.740000 | -0.549433 | 0.089079 | -0.118566 |
| 1 | 3.523307 | -0.789779 | -0.670658 | 0.493223 | -0.025134 | -0.203588 | 0.101518 |
| 2 | 2.998309 | -0.861604 | -0.982108 | 0.715598 | -0.281324 | -0.137351 | -0.055167 |
| 3 | 2.937560 | -0.949168 | -0.607196 | 0.531084 | -0.272607 | -0.295916 | -0.121296 |
| 4 | 2.930688 | -0.931822 | -1.078890 | 0.558607 | -0.543871 | -0.007707 | -0.167301 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 393 | -1.420970 | 1.225252 | -0.286402 | -0.671666 | 0.054472 | 0.187878 | 0.101922 |
| 394 | -4.094686 | 1.279998 | 1.960384 | 1.375464 | 0.740606 | -0.175097 | 0.087391 |
| 395 | -1.547254 | 1.252540 | -1.906999 | -0.323768 | -0.255922 | 0.254531 | 0.149028 |
| 396 | -2.022942 | 1.132137 | 0.609384 | -0.464327 | 0.186656 | -0.089169 | 0.075018 |
| 397 | -2.182941 | 1.236714 | 0.787268 | -0.157523 | 0.468128 | -0.025712 | 0.001612 |
398 rows × 7 columns
# Get the principal components (loadings or coefficients)
loadings = pca.components_.T * np.sqrt(pca.explained_variance_)
loadings_df = pd.DataFrame(loadings, index=subset.columns, columns=[f'PC{i+1}' for i in range(loadings.shape[1])])
# View the loadings
print(loadings_df)
PC1 PC2 PC3 PC4 PC5 PC6 \
mpg -0.890788 0.196955 -0.219344 0.324360 0.117137 0.053858
cylinders 0.932776 0.178494 0.120355 0.209194 -0.171294 -0.080622
displacement 0.962400 0.165281 0.088607 0.127081 -0.021883 0.034420
horsepower 0.945823 0.084143 -0.143674 -0.014914 0.256575 -0.118411
weight 0.927712 0.206761 0.239082 -0.048986 0.086248 0.165260
acceleration -0.637912 -0.022460 0.763104 0.055402 0.083880 -0.050522
model year -0.513836 0.848212 -0.015481 -0.129214 -0.031144 -0.032122
PC7
mpg -0.016776
cylinders -0.080618
displacement 0.152716
horsepower -0.011897
weight -0.069349
acceleration 0.009839
model year 0.009448
Observations:
Dimensionality Reduction: PCA transforms the original dataset with many features into fewer, uncorrelated principal components, while still retaining as much of the variance (information) as possible. For example, the first few components (like 0, 1, 2, etc.) explain the most variance.
New Feature Space: The numbers in the output table are the values of your original data points projected onto these new components. These values represent each data point in terms of the new principal components.
The sums of the explained variance ratios of the first 5 components, it gives us the total variance explained by these components.
Principal Component Loadings provide insight into which original features are driving the variance in your data. In other words it shows which features are the most important in explaining the variance captured by each component. Positive loadings indicate a positive correlation with the component, while negative loadings indicate a negative correlation.
Interpreting few of values from the Loadings_df :
We see PC1 has a negative loading with mpg feature. This means that as the miles per gallon (mpg) of the cars increases, the value of PC1 decreases. This suggests that cars with better fuel efficiency (higher mpg) are associated with lower values of PC1.
Similary PC1 has a positive loading with cylinders feature. This means that as the number of cylinders in the car increases, the value of PC1 also increases. So, cars with more cylinders are associated with higher values of PC1.
By understanding how features influence principal components, analysts can make informed choices that lead to more effective clustering and better overall insights from the data.
# Calculate loadings for the first three principal components
loadings = pca.components_.T[:, :3] * np.sqrt(pca.explained_variance_[:3])
# Create a DataFrame for better visualization, if needed
loadings_df = pd.DataFrame(loadings,index=subset.columns,columns=[f'PC{i+1}' for i in range(3)])
# Display the loadings for the first three principal components
print(loadings_df)
PC1 PC2 PC3 mpg -0.890788 0.196955 -0.219344 cylinders 0.932776 0.178494 0.120355 displacement 0.962400 0.165281 0.088607 horsepower 0.945823 0.084143 -0.143674 weight 0.927712 0.206761 0.239082 acceleration -0.637912 -0.022460 0.763104 model year -0.513836 0.848212 -0.015481
Observations:
1) Principal Component 1 (PC1): High Positive Loadings: Cylinders (0.932776), Displacement (0.962400), Horsepower (0.945823), and Weight (0.927712) all have strong positive coefficients. This suggests that PC1 is primarily influenced by these features. High Negative Loading: Miles per Gallon (mpg, -0.890788) and Acceleration (-0.637912) have negative coefficients. This indicates that as the values of these features increase, the value of PC1 decreases.
Insight: PC1 seems to represent a performance axis of vehicles where higher horsepower, weight, and displacement correlate with lower fuel efficiency (mpg) and acceleration. This could imply that higher performance vehicles tend to have larger engines and are heavier but are less fuel-efficient.
2) Principal Component 2 (PC2): High Positive Loading: Model Year (0.848212) has a strong positive loading, suggesting that more recent models tend to contribute positively to PC2. Moderate Positive Loadings: Weight (0.206761) and Cylinders (0.178494) have lower positive coefficients, indicating a lesser but still notable influence. Low Loadings for Others: Other features have low or near-zero loadings, indicating they are not significantly contributing to this component.
Insight: PC2 may represent a temporal aspect where newer cars tend to have different characteristics compared to older models. This could be related to advancements in technology, design, or regulatory changes affecting vehicle performance.
3) Principal Component 3 (PC3): High Positive Loading: Acceleration (0.763104) has a strong positive loading, indicating that higher acceleration values contribute positively to this component. Moderate Positive Loadings: Weight (0.239082) also shows a moderate positive influence, suggesting a relationship between acceleration and vehicle weight. Negative Loadings: Horsepower (-0.143674) and mpg (-0.219344) have negative coefficients, indicating that as horsepower and fuel efficiency increase, the value of PC3 decreases.
Insight: PC3 appears to capture a balance between acceleration and power efficiency. This could indicate that cars that accelerate quickly may not necessarily be the most powerful or fuel-efficient.
Overall Insights
1) Understanding Clusters:
These components can help identify distinct groups (or clusters) of vehicles based on performance, efficiency, and modernity. For instance, you could cluster vehicles that are high in PC1 and low in PC2 to find older high-performance vehicles.
2) Performance vs. Efficiency: The strong negative correlation between performance features (like horsepower and weight) and mpg in PC1 suggests a trade-off between performance and fuel efficiency. This insight can be useful for manufacturers focusing on performance vehicles while also considering environmental regulations.
3) Market Segmentation: Insights from PC2 can help marketers target different consumer segments, emphasizing modern features for newer vehicles while highlighting performance attributes for older models.
4) Product Development: Understanding how features contribute to these principal components can guide product development. For example, if a new model is being developed, balancing weight and horsepower could be key for achieving desirable acceleration and fuel efficiency metrics.
5) Future Considerations: As technology evolves, the relationship between these features may change. Continual analysis using PCA can help monitor trends in vehicle performance, efficiency, and consumer preferences over time.
In summary, the coefficients of the principal components provide a deeper understanding of how features interact and influence vehicle characteristics, which can be invaluable for clustering analysis, market strategy, and product development in the automotive industry.
# Applying PCA
pca = PCA(n_components=2) # Set the number of components you want
data_pca1 = pd.DataFrame(pca.fit_transform(subset_scaled_df)) # Applying PCA
# Renaming the columns of the PCA DataFrame
data_pca1.columns = [f'PC{i+1}' for i in range(data_pca1.shape[1])]
# Display the PCA DataFrame with labeled columns
print(data_pca1.head())
PC1 PC2 0 2.661556 -0.918577 1 3.523307 -0.789779 2 2.998309 -0.861604 3 2.937560 -0.949168 4 2.930688 -0.931822
# Create a scatter plot
plt.figure(figsize=(10, 6))
#plt.scatter(data_pca1['PC1'], data_pca1['PC2'], marker='o')
plt.scatter(data_pca1['PC1'], data_pca1['PC2'], marker='o')
# Adding labels and title
plt.title('PCA - First Two Principal Components')
plt.xlabel('Principal Component 1 (PC1)')
plt.ylabel('Principal Component 2 (PC2)')
plt.grid()
# Show the plot
plt.axhline(0, color='gray', lw=0.5, ls='--')
plt.axvline(0, color='gray', lw=0.5, ls='--')
plt.show()
Observations:
# Fitting t-SNE with number of components equal to 2
tsne = TSNE(n_components = 2, random_state = 1)
data_tsne = tsne.fit_transform(subset_scaled)
# Converting the embeddings to a dataframe
data_tsne = pd.DataFrame(data_tsne, columns = ['TSNE_X1', 'TSNE_X2'])
# Scatter plot for two components
plt.figure(figsize = (7,7))
sns.scatterplot(x = 'TSNE_X1', y = 'TSNE_X2', data = data_tsne)
plt.show()
# Fitting t-SNE with number of components equal to 3
tsne = TSNE(n_components = 3, random_state = 1)
data_tsne = tsne.fit_transform(subset_scaled)
data_tsne = pd.DataFrame(data_tsne, columns = ['X1', 'X2', 'X3'])
# Scatter plot for all three components
fig = plt.figure(figsize = (12, 12))
ax = fig.add_subplot(111, projection = '3d')
x = data_tsne['X1']
y = data_tsne['X2']
z = data_tsne['X3']
ax.scatter(x, y, z)
plt.show()
Observations:
We know that t-SNE preserves the local structure of the data while embedding the data from high dimensions to low dimensions. Here, we have generated the 2D and 3D embeddings for the data. We can clearly see 3 groups in the data. Data - it is scattered and clustered together with not much outliers.
# Let's try to visualize the data for different perplexity values
for i in range(10, 50, 5):
tsne = TSNE(n_components = 2, random_state = 1, perplexity = i)
data_tsne = tsne.fit_transform(subset_scaled)
data_tsne = pd.DataFrame(data_tsne)
data_tsne.columns = ['X1', 'X2']
plt.figure(figsize = (7, 7))
sns.scatterplot(x = 'X1', y = 'X2', data = data_tsne)
plt.title("perplexity = {}".format(i))
Observations: We observe that some perplexity values like 35 and 45 are able to capture the underlying patterns in the data better than other values. This shows that perplexity plays an important role in t-SNE implementation. Let's visualize again with perplexity equal to 35 as there are 3 clear groups which are distant from each other, i.e., well separated.
# Fitting t-SNE with number of components equal to 2
tsne = TSNE(n_components = 2, random_state = 1, perplexity = 45)
data_tsne = tsne.fit_transform(subset_scaled)
# Converting the embeddings to a dataframe
data_tsne = pd.DataFrame(data_tsne, columns = ["X1", "X2"])
# Scatter plot for two components
plt.figure(figsize = (7, 7))
sns.scatterplot(x = 'X1', y = 'X2', data = data_tsne)
plt.show()
Observations: We can clearly see 3 groups in the data. Let's label these 3 groups using the values of the X1 and X2 axes.
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import seaborn as sns
# Assuming 'data_tsne' is your DataFrame with 'X1' and 'X2' as the t-SNE components
# Apply K-Means clustering
kmeans = KMeans(n_clusters=3)
data_tsne['cluster'] = kmeans.fit_predict(data_tsne[['X1', 'X2']])
# Plotting the clusters with different colors
plt.figure(figsize=(8, 8))
sns.scatterplot(x='X1', y='X2', hue='cluster', data=data_tsne, palette='Set1', s=60)
plt.title('t-SNE Clusters')
plt.show()
Observations:
The plot shows clear separation between the three clusters (marked in red, green, and blue).
This indicates that the data points in each cluster are more similar to each other than to those in other clusters.
1) Red Cluster (Cluster 0):
This cluster consists of a majority of the points and is concentrated on the left side of the plot, spanning from approximately X1 = -10 to X1 = -20 and ranging vertically from X2 = -10 to X2 = 10.
The points here might represent a certain group with unique characteristics (e.g., lower values on the x-axis).
2) Green Cluster (Cluster 1):
This cluster appears to be a smaller group located centrally, with points primarily between X1 = 0 to X1 = 10 and X2 = -5 to X2 = 5.
This indicates that these data points may share similar features that differentiate them from the others.
3) Blue Cluster (Cluster 2):
The blue cluster is distinct and spread towards the right side of the plot, around X1 = 20 to X1 = 30 and mostly in the range of X2 = -5 to X2 = 0.
This suggests another unique group with specific properties, potentially indicating higher values along the x-axis.
There appear to be some isolated points, especially in the green and blue clusters. This could indicate outliers or unique observations that may need further investigation.
The distribution of data points indicates varying densities among clusters. For instance, the red cluster seems to have a higher density of points, while the green cluster has a more scattered arrangement.
Each cluster could represent different categories, types, or behaviors in your dataset, warranting further analysis to understand the underlying factors contributing to these separations.
subset_scaled_df
| mpg | cylinders | displacement | horsepower | weight | acceleration | model year | |
|---|---|---|---|---|---|---|---|
| 0 | -0.706439 | 1.498191 | 1.090604 | 0.673118 | 0.630870 | -1.295498 | -1.627426 |
| 1 | -1.090751 | 1.498191 | 1.503514 | 1.589958 | 0.854333 | -1.477038 | -1.627426 |
| 2 | -0.706439 | 1.498191 | 1.196232 | 1.197027 | 0.550470 | -1.658577 | -1.627426 |
| 3 | -0.962647 | 1.498191 | 1.061796 | 1.197027 | 0.546923 | -1.295498 | -1.627426 |
| 4 | -0.834543 | 1.498191 | 1.042591 | 0.935072 | 0.565841 | -1.840117 | -1.627426 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 393 | 0.446497 | -0.856321 | -0.513026 | -0.479482 | -0.213324 | 0.011586 | 1.621983 |
| 394 | 2.624265 | -0.856321 | -0.925936 | -1.370127 | -0.993671 | 3.279296 | 1.621983 |
| 395 | 1.087017 | -0.856321 | -0.561039 | -0.531873 | -0.798585 | -1.440730 | 1.621983 |
| 396 | 0.574601 | -0.856321 | -0.705077 | -0.662850 | -0.408411 | 1.100822 | 1.621983 |
| 397 | 0.958913 | -0.856321 | -0.714680 | -0.584264 | -0.296088 | 1.391285 | 1.621983 |
398 rows × 7 columns
# Boxplots needs to be created for each feature to compare how they are distributed across the identified clusters:
all_col = subset_scaled_df.columns[:7].tolist()
plt.figure(figsize = (30, 50))
for i, variable in enumerate(all_col):
plt.subplot(8, 3, i + 1)
sns.boxplot(y=subset_scaled_df[variable], x=data_tsne['cluster'])
plt.tight_layout()
plt.title(variable)
plt.show()
Observations:
1) MPG (Miles per Gallon):
Cluster 0: Represents cars with relatively high fuel efficiency, which is a key characteristic of lighter vehicles or cars with smaller engines. Cluster 1: Cars with lower fuel efficiency, possibly due to larger engines or older technology, reflecting high-displacement engines. Cluster 2: A mix of vehicles, with moderate fuel efficiency but more consistency in mpg.
2) Cylinders: Cluster 0: Contains vehicles with fewer cylinders, likely smaller engines. Cluster 1: A wider distribution of vehicles with varying cylinder counts, indicating the presence of both smaller and larger engine cars. Cluster 2: Homogeneous in cylinder count, possibly a distinct group of similar engine types.
3) Displacement:
Cluster 0: Low engine displacement, typically associated with more fuel-efficient vehicles. Cluster 1: High engine displacement, representing cars with larger engines and likely lower fuel efficiency. Cluster 2: Moderate displacement, possibly representing mid-sized vehicles.
4) Horsepower:
Cluster 1: Cars with significantly higher horsepower, representing powerful vehicles with larger engines. Cluster 0 and 2: Have lower horsepower distributions, possibly representing cars with smaller engines or older models.
5) Weight:
Cluster 1: Heavier cars, which may correlate with larger engines and lower fuel efficiency. Cluster 0: Contains lighter vehicles, which is consistent with higher fuel efficiency. Cluster 2: Represents mid-range vehicles in terms of weight.
6) Acceleration:
Cluster 0: Faster acceleration, possibly due to lighter weight and smaller engines. Cluster 1: Slower acceleration, consistent with the heavier weight and higher horsepower of the vehicles in this group. Cluster 2: Similar to Cluster 0, but with some variation in performance.
7) Model Year:
Cluster 0: Consists of older cars, possibly vintage models. Cluster 1: Represents newer vintage cars with slightly more advanced features. Cluster 2: A mix of years, with some older models.
Conclusion
Through this analysis, we identified three distinct clusters of vintage cars, each representing a unique segment of vehicles:
Cluster 0: Likely represents older, lighter, and more fuel-efficient vehicles, making them appealing to customers interested in vintage cars with economical performance.
Cluster 1: Contains heavier, more powerful vehicles with high horsepower, catering to enthusiasts of vintage muscle cars or high-performance vehicles.
Cluster 2: Represents a middle-ground cluster with moderate features, appealing to a broad range of vintage car buyers.
These insights can help SecondLife target specific customer segments based on vehicle characteristics, improving the dealership’s marketing strategy and sales approach.
High Correlation Among Features: Features such as horsepower, weight, and displacement show strong correlations with miles per gallon (mpg). This indicates that optimizing these factors could significantly enhance fuel efficiency.
Categorical Feature Trends: Certain car names demonstrate higher sales or popularity. Analyzing the features of these models could inform future marketing and production decisions.
Skewed Distributions: Variables like mpg and weight exhibit right skewness. Data transformation techniques may be necessary for more accurate modeling.
Distinct Clusters: The t-SNE analysis indicates distinct clusters within the data, suggesting that different customer segments may have unique preferences and behaviors.
Feature Optimization: Focus on reducing vehicle weight and enhancing horsepower through engineering improvements, as these are likely to improve mpg.
Targeted Marketing: Utilize insights from popular car names to tailor marketing strategies and promotions for similar models or features that resonate with consumers.
Data Preprocessing: Apply transformations (e.g., log transformation) to skewed variables to enhance model performance and interpretability.
Segmentation Strategies: Leverage the clustering insights to develop targeted marketing strategies for different customer segments based on their preferences and behaviors identified in the analysis.
By implementing these insights and recommendations, the organization can enhance vehicle design, optimize marketing efforts, and ultimately improve customer satisfaction and sales performance.
Performance Segment (Cluster 1):
Larger engines (high displacement, cylinders, and horsepower) and heavier cars. Lower fuel efficiency and acceleration. Likely represents high-performance vehicles such as muscle or sports cars.
PCA PC1 confirms this group as performance-focused, and t-SNE clusters highlight it clearly.
Fuel Efficiency Segment (Cluster 0):
Smaller, lighter cars with higher mpg and better acceleration. Represents more economical, fuel-efficient cars, which may appeal to environmentally-conscious customers or those looking for economical vintage cars.
PCA PC1 and t-SNE align this group as focused on fuel efficiency and light-weight characteristics.
Temporal Segment (Cluster 2):
Newer models of cars with moderate performance, engine size, and weight. This segment may reflect technological improvements over time, and PC2 captures this temporal trend well.
Acceleration vs Power Efficiency Trade-off:
Certain cars prioritize acceleration over power, reflected in PC3, and this insight is visible in the t-SNE clustering as well. These could be marketed as sporty, quick cars despite not having the highest horsepower.
-Performance Cars (Cluster 1): Target high-performance enthusiasts interested in vintage muscle cars or sports cars. They may be willing to compromise on fuel efficiency for power and design.
Economical, Fuel-Efficient Cars (Cluster 0): Focus on customers interested in environmentally friendly, efficient vehicles. These cars would appeal to buyers who value economical usage or have an interest in light, classic models.
Technological or Modern Vintage Cars (Cluster 2): Appeal to customers looking for newer vintage models that offer a blend of performance and technological advancements.
By using these insights, SecondLife can tailor its marketing strategy to focus on each group’s unique characteristics and maximize its appeal to specific customer segments.